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X/^ \ Abstract 

I~~J ■ We solve the problem of finding interspersed maximal repeats using a suffix array 

C/3 ' construction. As it is well known, all the functionality of suffix trees can be handled by 

I M suffix arrays, gaining practicality. Our solution improves the suffix tree based approaches 

for the repeat finding problem, being particularly well suited for very large inputs. We 

'"^ . prove the corrrectness and complexity of the algorithms. 
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• . 1 Introduction 

'NT . 

O' 

ff^ ' Many basic problems on strings have been solved in the past using the suffix tree data structure 

^~~^ . because of its theoretical linear complexity bounds, [4]. However, suffix trees proved to be 

K* I impractical when handling very large inputs, as needed in comparative genoinics or web 

k><( ■ indexing. The linear time and memory bounds of suffix trees hide a large constant factor, so 

Vh ■ many of the algorithms based on suffix trees were superseded by algorithms based on suffix 



a 



arrays, see [7, 14, 18]. 

In this paper we focus on the classical problem of finding repeats in a given string, and we 
give a solution based on the suffix array. This is the kind of algorithm that everyone believes 
it must have already been done. Indeed, we published our algorithm without a formal analysis 
in 2009 in [1]. The algorithm has been recently cited as the state of the art for repeat finding 
with suffix arrays [10]. The purpose of the present paper is to document the correctness of 
the algorithm and to give an analysis of its time and memory complexity. In addition, the 
version of the algorithms we give here contain some improvements in memory usage. 

We consider two variants of the repeat finding problem. The first one is to find all maximal 
repeats in a given string, where a maximal repeat is a substring that occurs at least twice, 
and all its extensions occur fewer times. The second is the problem of finding all the maximal 
substrings in a give string that occur more than once. Using the terminology of Gusfield [4], 
we call them supermaximal prepeats. 



2 Notation and preliminary definitions 

Notation. Assume the alphabet A, a finite set of symbols. A string is a finite sequence of 
symbols in A. The length of a string w, denoted by \w\, is the number of symbols in w. We 
address the positions of a string w by counting from 1 to \w\. The symbol in position i is 
denoted w[i], and w[i..j] represents the substring that starts in position i and ends in position 
j of w, inclusive. A prefix of a string w is an initial segment of w, w[l..i]. A suffix of a string 
w is a final segment of w, it;[i..|w|]. We say u is a substring of tf if u = wli-.j] for some i, j. 
If u is a substring of w we say that u occurs in w at position i li u = w[i..i + \u\ — 1]. When 
n is a substring of w, we say that w is an extension of u. 

Definition 1 (Maximal repeat, [4]). A maximal repeat of a string w is a string that occurs 
more than once in w, and each of its extensions occurs fewer times. 

Example 2. The set of maximal repeats of it; = abcdeabcdfhcde is {abed, bcde, bed}. Clearly 
abed and bcde are maximal repeats occurring twice. But also bed is a maximal repeat because 
it occurs three times in w, and every extension of bed occurs fewer times. There are no other 
maximal repeats in w (be, for example, occurs three times, but since bed occurs the same 
number of times, be is not a maximal repeat.) 

This example already shows that maximal repeats can be nested and overlapping. A bound 
of the number of maximal repeats on a string is already given in [4]. 

Theorem 3 ([4], Theorem 7.12.1). The number of maximal repeats in a string of length n is 
no greater than n. 

This result can also be derived from Algorithm 1 and Theorem 11. 

3 An algorithm to find all maximal repeats 

Let It; be a string of length n = \w\. The suffix array ([16]) of it; is a permutation r of the 
indices l...n such that for each i < j, w[r[i]..n] is lexicographically less than u)[r[j]..n]. Thus, 
a suffix array represents the lexicographic order of all suffixes of the input w. For convenience 
we also store the inverse permutation of r and call it p, namely, J'[r[i]] = i. As a first step 
of our procedure we use the fast algorithm of [14] to build the suffix array of the input w in 
time O(nlogn). 

We can think each substring of tt; as a prefix of a suffix of w. Suppose a maximal repeat 
u occurs k times in w; then, it is a prefix of k different suffixes of w. Since the suffix array r 
records the lexicographical order of the suffixes of w, the maximal repeat u can be seen as a 
string of length \u\ addressed by k consecutive indices of r. Namely, there will be an index 
i such that u occurs at positions r[i],r[i + 1],..., and r[i + k — 1] of w. The algorithm has to 
identify which strings addressed by consecutive indices of the suffix array are indeed maximal 
repeats. 

We compute the length of the longest common prefix of each pair of consecutive suffixes 
in the lexicographic order. For this task we use the linear time algorithm of Kasai et al. [8]. 
For any position 1 < i < n, LCP[i] gives the length of the longest common prefix of ti;[r[i]..n] 
and w[r[i + l]..n]. 

Definition 4. A substring of a given string is maximal to the right when all its extensions 
to the right occur fewer times; similarly for maximality to the left. 



This definition allows for a characterization of maximal repeats that we use in Algorithm 1 . 

Proposition 5. A substring of a given string is a maximal repeat if and only if it is maximal 
to the left and to the right. 

To identify maximal repeats we first identify the candidates that are maximal to the right, 
and then we filter out those that are not maximal to the left. The next propositions assume 
the suffix array r of the input string w, its inverse permutation p, and the LCP array. 

Proposition 6. A substring u of w is repeated and maximal to the right if and only if there 
is an index i, 1 < i < n, and a number of occurrences k > 2 such that 

1. u = w[r[i]..r[i] + \u\ — 1]; 

2. u occurs exactly k times in w, 

- for each t G [i,i + k-2], LCP[t\ > \u\, 

- LCP[i - 1] < \u\ ori = l, 

- LCP[i + k -I] <\u\ or i + k -I = n, 

3. There is t e [i,i + k-2] such that LCP[t] = \u\. 

Proof. We first prove the implication to the right. Assume u occurs more than once in 
w, and let k be its number of occurrences. Let i = min{t : u'[r[t]..r[t] + \u\ — 1] = u} 
and j = max{t : t(;[r[t]..r[t] + \u\ — 1] = u}. By definition of r, for each t such that 
i < t < j, w[r[i]..n] is lexicographically before w[r[t]..n] which in turn is lexicographically 
before w[r[j]..n]. Since the first \u\ characters of the suffixes addressed by i and j are the 
same, for every such t, w[r[t]..r[t] + |n| — l]=u. Therefore, the longest common prefix of any 
two of these strings has length at least \u\. Since j is the largest index and i is the smallest, for 
any other index h outside the range of i..j, either n — r[h] < \u\ or ti;[r[/i]..r[/i] + \u\ — 1] ^ u, 
therefore the longest common prefix between 7i;[r[i]..n] and any of the suffixes outside the 
interval addressed by i..j is necessarily less than \u\. Finally, since u is maximal to the right, 
there must be at least two occurrences of u having different extensions to the right. This 
causes a value in the LCP to be exactly \u\. 

The reverse implication is implied by the following observation. Let u = w[r[i]..r[i] + \u\ — l] 
and assume all strings ty[r[t]..r[t] + \u\ — 1] with i <t < j = i + k share its first \u\ characters, 
so they all start with u. Assume there is some t, i<t<i + k — 1 such that LCP[t] = \u\; 
therefore, either n — max{r[t],r[t + 1]) = \u\ or tf[r[i] + \u\] ^ w[r[t + 1] + \u\]. In either 
case, the repeat cannot be extended to the right. Since the previous and next strings outside 
the interval addressed by i..j, if they exist, have a longest common prefix of length less than 
\u\, they do not start with u. By definition of r, every suffix starting with u has to be 
included in the considered interval. Hence, every extension of u to the right occurs fewer 
times than u itself. D 

Proposition 7. A substring u of w maximal to the right, is also maximal to the left if and 
only if there is an index i, 1 < i < n, and a number of occurrences k > 2 such that 

1. u occurs k times in w, 

2. w[r[i\ — 1] 7^ w[r[i + k — \\ — 1], or 
r[i] = 1, or r[i + k — 1] = 1, or 

p[r[i + k -!]-!]- p[r[i\ -l]^k-l. 



Proof. We use the characterization of niaxiniahty to the right given in Proposition 6. Assume 
u is maximal to the right, let i be its first index in r and let k be its number of occurrences. 
We first show the implication to the left by proving the contrapositive. Suppose u is not 
maximal to the left, then there is a symbol c such that cu occurs the same number of times 
as u. All occurrences of cu also contain u, hence for each of the k occurrences of u there is 
a c before u. Since u occurs at positions r[z], ...,r[i + k — 1] in w, cu must occur at positions 
r[i] — 1, ...,r[i + A; — 1] — 1. Thus, r[i] ^ 1, r[i + k — 1] ^ 1 and w[r[i]] = w[r[i + k — 1]] = c. 
Observe that the relative order in r of the suffixes starting with u is the same as the relative 
order of all the suffixes starting with cu, because they are the same strings, with a c added 
at the front. So, for each j = 0...k — 1, p[r[i] — 1] = p[r[i + j] — 1] + j, and in particular, 
p[r[i] — 1] = p[r[i + k — 1] — 1] + k — 1. 

For the implication to the right assume u is repeated, maximal to the right and maximal 
to the left. Since u occurs k times, k > 2, there is an index i such that for every t, i < 
t<i + k — 1, u = w[r[t]..r[t] + |n| — 1]. Assume r[i] ^ 1, and r\i ^ k — \\ ^ 1, and 
p[r[i + A; — 1] — 1] — p[r[i\ — 1] = A: — 1. Since u is maximal to the left, there must be at 
least two positions a and h oi w witnessing that u cannot be extended to the left. Either 
w[a — 1] 7^ w[6 — 1] or a = 1 or 6 = 1 (recall the positions of a string are numbered starting at 
1), while u = w[a..a + \u\ — 1] = w[b..b + \u\ — 1]. But a = r[t] and b = r[t'] for some indices 
t, t' in the range i..i + k — l. Since r records the lexicographic ordering, if that happens on any 
pair of positions t,t' , i < t < t' < i + k — 1, it also happens choosing t = i and t' = i + k — 1, 
therefore, if none of those indices is 1, w[r[i] — 1] ^ w[r[i + k — 1] — 1]. D 

We call the algorithm findmaxr. Its pseudocode is described in Algorithm 1. It takes as 
an extra parameter an integer ml which is the minimum length of a maximal repeat to be 
reported (it can be set to 1 if desired). 

The idea of the algorithm is to treat all suffix intervals as described in Proposition 6, one 
by one, but in non-decreasing order of their longest common prefix. We use a set S to keep 
track of the LCP values already seen at each current step of the algorithm, and the fake 
indices and n to treat border cases. The algorithm first discards all values less than ml, 
inserting them into S, hence, the required computational time in practice diminishes as m,l 
increases. 

The main loop of the algorithm iterates on the LCP values in non-decreasing order. For 
each current value, the algorithm finds the largest interval of indices in the suffix array r, 
above and below it, such that all the contained LCP values are greater than or equal to the 
current one. Being the largest interval ensures that all occurrences of the addressed string 
are taken into account. Then the algorithm checks whether this interval implies a string 
that is maximal to the right and to the left. Since the main loop iterates on LCP values in 
non-decreasing order, each current interval of indices in r lays between the closest previously 
used indices above and below it. Notice that the interval is only valid when the limits are 
strictly less (and not equal) than the current LCP value. Each of the intervals considered in 
Proposition 6, is addressed in the first visited of all j such that LCP[j] = \u\. 

The algorithm keeps track of the set S of indices already seen; these are the indices of 
LCP that have already been treated in the main loop of the algorithm. The data structure 
to implement this set S should be efficient for the insertion operation in S, as well as the 
queries for the "minimum greater than a given value" and the "maximum less than a given 
value". Notice that these are the most expensive operations of the main loop, being critical 
for the overall time complexity. All other operations in the main loop are 0(1). 



Algorithm 1 findmaxr(tt;, ml) 



n := \w\ 

r := suffix array of w 
p := inverse permutation of r 
LCP := longest common prefix array of w and r 

S := {u : LCP[u] < ml} U {0, n} — discard all indices of w whose LCP is less than m,l — 
/ := permutation of [0,n - 1] such that LCP[I[i\] < LCP[I[i + 1]] 
initial := min{t : LCP[/[t]] > ml} 
for t := initial to n — 1 do 
i := I[t] 

Pi := max{j : j € S* A j < i} + 1 
ni := min{j : j £ S A j > i} 
S:=SU{i} 
if {pi = 1 or LCP[pi - 1] / LCP[{\) and (n^ = n or LCP[ni] / LCP[i]) then 

— here we have a substring maximal to the right, check if it is maximal to the left — 
if r[pi] = 1 or r[ni] = 1 or u;[r[pj] — 1] 7^ U7[r[nj] — 1] or p[r[nj] — 1] — p[r[j)j] — 1] 7^ nj —pi 
then 

— here it is both maximal to the right and to the left — 

report the maximal repeat tf [r[i]..r[i] + LCP[i] — 1] and its nj — j)j + 1 occurrences, 
whose list of positions in w are contiguous in r starting at pi. 
end if 
end if 
end for 



Algorithm 2 Maximum less than(t) 



— inquire the bit tree S — 
repeat 

if t is a left child then 

t := rightmost node to the left of t in its level 
else 

t := parent (t) 
end if 
until node t is set to 1 
while t is not a leaf do 

if right child(i) is set to 1 then 

t := right child (t) 
else 

t := left child (t) 
end if 
end while 
return t 



We represent the set S with a small data structure, which is efficient for the mentioned 
operations ensuring they take time 0{log n). This data structure requires to have the number 
of elements of the universe specified in advance. In our case this is not a problem, because 
the universe is the integer interval [0,n + 1]. We implement S* as a binary tree of bits with 
n + 2 leaves. Each leaf represents one of the n + 2 elements of the universe, the value is 1 if 
the element is in S*, and it is otherwise. An internal node has value if and only if both of 
its children are 0, otherwise it has value 1. The insertion operation only needs to update the 
branch of the modified leaf, so it can be done in log n time. 

Algorithm 2 gives the pseudocode to find the maximum less than a given t. The idea is 
to go up in the tree always moving to the rightmost node that has a chance of being set to 1 
because of a leaf to the left of the parameter t; this rapidly increases the interval represented 
by the nodes, because each move up multiplies the size of the checked interval by at least 3/2. 
As soon as we find a node with value 1, we immediately go down in the tree looking for the 
rightmost leaf that caused that 1. Finding the minimum greater than t is analogous. 

3.1 Improved implementation saving n word_size memory 

We give an alternative implementation of Algorithm 1 that lowers the needed memory to 
n (3word_size + log |^| + 2), while keeping the same time complexity, cf. Theorems 13 and 14. 
In this variant we get rid of the resident LCP array before entering into the main loop, and 
we efficiently deal with the three LCP values needed inside the mainloop. 

Algorithm 3 gives the pseudocode. It starts exactly as Algorithm 1, first computing the 
suffix array r, the LCP array and the set S. Then it builds the array / using the simple 
algorithm Inverse a permutation in place of [9] Algorithm I, Section 1.3.3, attributed there to 
[6]. This inversion requires time 0{n) and n auxiliary bits. However, the LCP array is not 
a permutation of l..n, so we first map the LCP into a permutation l..n, using an auxiliary 
array. At this point we can free the LCP space, and the array / is constructed in the same 
place of the auxiliary permutation. 

Both, Algorithm 1 and Algorithm 3, in their main loop iterate the indices in non-decreasing 
order of the LCP values. However, Algorithm 3 requires that in case of tie of the LCP values 
the indices he iterated in increasing order. This ordering is ensured when we construct the 
array /. The rest of the implementation is exactly as Algorithm 1. 

Proposition 8. The total memory required by Algorithm 3 sums up to the original input w, 
the set S, and three arrays of length n with values between and n. 

Proof. The input w, the set S, the suffix array r are permanently in memory. Besides there 
are at most two other integer arrays of length n in memory. First the LCP and the auxiliary 
permutation array. Then LCP is discarded and array / is constructed in place of the auxiliary 
permutation using n auxiliary bits. Then, these auxiliary bits are discarded and array p is 
constructed. D 

The total number of instructions involved in the whole computation of the main loop of 
Algorithm 3 and those in the main loop of Algorithm 1 differ in 0{1). Therefore, the overall 
time complexity of the two variants of findmaxr is the same. 

Proposition 9. The overall computation of all the needed LCP values in the main loop of 
Algorithm 3 require at most 0{n) comparisons. 



Algorithm 3 findmaxr(tt;, ml) 



n := \w\ 

r := suffix array of w 

p := inverse permutation of r 

LCP := longest common prefix array of w and r 

S := {u : LCP[u] < ml} U {0, n} — discard all indices of w whose LCP is less than m,l — 

/ := permutation of [0,n - 1] such that LCP[I[i\] < LCP[I[i + 1]] 

initial := min{t : LCP[/[t]] > ml} 

lasti := —1 

lastLCP := 

cur LCP := 

for t := initial to n — 1 do 

i:=m 

while u;[r[i] + cur LCP] = w[r[i + 1] + cur LCP do 

cur LCP := cur LCP + 1 
end w^hile 

Pi := max{j £SAj<i} + l 
ni := min{j € S" A j > i} 
S:=S\J{i} 
if {pi = 1 or {lasti = pi — 1 and lastLCP ^ cur LCP)) then 

— here we have a substring maximal to the right, check if it is maximal to the left — 
if r\pi] = 1 or r[ni] = 1 or w[r[pj] — 1] 7^ w[r[nj] — 1] or p[r[ni] — 1] — p[r[pi] — 1] / n^ —pi 
then 

— here it is both maximal to the right and to the left — 

report w[r[i]..r[i] + curLCP — l] and its ni—pi + 1 occurrences, whose list of positions 
in w are contiguous in r starting at pi. 
end if 
end if 

lastLCP := curLCP 
lasti := i 
end for 



Proof. The main loop of Algorithm 1 uses three values in the LCP array. In Algorithm 3 
we compute the needed values, profiting that indices i are visited in non decreasing order 
of their LCP values, and those having the same LCP value are visited in increasing order 
(this ordering is achieved when we construct array /). To compute LCP[i] do the comparison 
of the two suffixes character by character but starting from the previously used LCP value. 
Hence, the total number of character comparisons in the overall computation of all the LCP 
values is at most 2n (there can be at most n comparisons that yield each of the 2 possible 
results). For the comparison LCP\pi — I] ^ LCP[i]: Since pi — 1 = max{j £ S A j < i}, the 
index pi — 1 was already seen, so its LCP value is no greater than LCP[i]. In case LCP[i] 
differs from the LCP value of the last index seen, given that indexes are visited in non 
decreasing order of their LCP value, and pi — 1 a S, hence it was already seen, we conclude 
LCP\pi — 1] 7^ LCP[i]. In case LCP[i] equals the LCP value of the last index seen, since 



indexes having the same LCP value are visited in increasing order, LCP[pi — 1] = LCP[i] 
exactly when pi — 1 coincides with the last index seen, because pi — 1 is the largest seen index 
smaller than i. For the comparison LCP[ni — 1] 7^ LCP[i]: Since indices with the same 
LCP value in are visited in increasing order, and rii = min{j (z S A j > i}, namely, rii is the 
smallest seen index greater than i, then necessarily LCP[ni] < LCP[i]. Thus, the inequality 
LCP[ni] / LCP\i] is always true. D 

3.2 Correctness of algorithm findmaxr 

Theorem 10 (Correctness). The algorithm findmaxr computes all occurrences of all maxi- 
mal repeats in the input string, in increasing order of length. 

Proof. Consider Algorithm 1. The main loop sequentially access the array /, the permutation 
array for the non-decreasing order of LCP. By Proposition 5 maximal repeats are exactly 
the candidates that are maximal to the right and to the left. Propositions 6 and 7, fully 
characterize these properties with conditions on the data structures LCP, r, and p. The 
algorithm checks these conditions. D 

3.3 Complexity of algorithm findmaxr 

As is custom in the literature on algorithms we express the time and space complexity as- 
suming integer values can be stored in a unit, and integer additions and multiplications can 
be done in 0(1). These assumptions make sense because the integer values involved in the 
algorithm fit into the processor word size for practical cases. Although the algorithm is scal- 
able for any input size, the derived complexity bounds are guaranteed only if the input size 
remains under the machine addressable size. Otherwise, the classical logarithmic complexity 
charge for each integer operation becomes mandatory. 

Assume an input size of n symbols. To bound the time complexity of our main algorithm, 
we first show that the set of all maximal repeats and their occurrences can be represented in 
a concise way. 

Theorem 11. For any string w of length n, the set of all maximal repeats and all their 
occurrences is representable in space 0{n). 

Proof. Each iteration of the main loop of Algorithm 1 reports at most one maximal repeat, 
followed by the list of all its occurrences. Each reported maximal repeat and all its occurrences 
can be represented with three unsigned integers: an index i in the suffix array r, a length i, 
and the number of occurrences m. The reported maximal repeat is the prefix of length i of 
the suffix at position r[i]. Its m occurrences are respectively in positions r[i],..,r[i + m — 1]. 
Each of these integers is at most n (where n is at most the maximum addressable memory) 
and we need n of them. Assuming that these integer values can be stored in fixed number 
of bits, this output requires size 0{n). Finally, we need to store the suffix array r, which 
contains n integer values that are a permutation of l..n, so it also requires 0{n). The input 
w also takes 0{n) space, since each symbol in A also takes 0(1) because |^| < n. D 

Of course, if instead of charging a fixed number of bits to store an integer, we count the 
length of its bit representation, the total needed output space to report all maximal repeats 
and occurrences in a given input string of length n becomes 0{nlogn). The input w in this 



case would have the same 0{n\ogn) bound, but probably takes a lot less because alphabet 
sizes are usually small compared with n. 

Maximum_less_than and Minimum_greater_than take 0{logn) time. We prove one, the 
other is analogous. 

Proposition 12. The time complexity of Maxim,um,dess-than a given value is C'(logn). 

Proof. In the repeat loop of Algorithm 2 there is at most one move to the right for each 
move up in the tree, hence, we have in total 0(logn) iterations. Then, in the while loop, 
every move goes down one level, therefore there are also 0(logn) total moves. If the tree is 
implemented over a bit array — in our implementation we use a bit array for each level in the 
tree — , all moves are easily implemented in 0(1); thus, the entire running time of each query 
isC(logn). D 

Theorem 13 (Time Complexity). The algorithm, findmaxr takes time 0{n\ogn). 

Proof. Consider Algorithm 1 or its variant as Algorithm 3 together with Proposition 9. All 
steps before the main for loop take 0{nlogn) operations. The main loop iterates n times. 
The most expensive procedure performed in the loop body is the manipulation of the tree for 
the set S, which requires at most O(logn) operations, cf. Proposition 12. D 

Theorem 14 (Space complexity). For an input of size n, algorithm findmaxr uses 0{n) 
memory space. More precisely, it uses n (Swordsize + log \A\ + 2) + 0{1) bits. 

Proof. Consider Algorithm 3. The whole input w is allocated in memory. Since it contains 
n symbols, its memory usage is nlog \A\ bits. By Proposition 8 the total amount of memory 
needed is for the input w, the set S and three integer arrays of length n with values between 
and n. The described tree for the set S has 2n + 1 nodes, implemented with an array of 
2n + 1 bits. Then, the exact memory space of all the data structures is n (3 word_size + 
log 1^1 + 2) bits. The local variables are counted in the 0(1) term. D 

4 An algorithm to find all supermaximal repeats 

We solve here the problem of finding the maximal substrings of the input that are repeated. 
These are a subset of the maximal repeats (in the sense of Definition 1) in the input string 
such that none of their extensions is also a maximal repeat, called supermaximal repeats. 
For instance, in Example 2 the set of all maximal repeats in string w = abcdeabcdfbcde is 
{abed, bade, bed}. While the set of supermaximal repeats is just {abcd,bcde} since bed is a 
substring of abed (and also of bcde) . 

Definition 15 (Supermaximal repeats, [4]). A string u is a supermaximal repeat inw if u is 
a substring that occurs at least twice in w and each extension of u occurs at most once in w. 

Algorithm 4, called findsmaxr, finds all supermaximal repeats in a given string w. It 
is similar to the algorithm findmaxr of the previous section, but simpler. We define an 
analogous notion of maximality to the right and to the left for this case. 

Definition 16. A substring u that occurs more than once in w is supermaximal to the right 
if all of its extensions to the right occurs at most once in w. Supermaximality to the left is 
defined analogously. 



Algorithm 4 findsmaxr(ttJ,?TT-/) 



n := \w\ 

r := suffix array of w 

LCP := longest common prefix array of w and r 

up := 1 

for i := 2 to n — 1 do 

if LCP\i] > LCP[i - 1] then 

up := i — starting position in LCP for the set of local maximum values — 
else 

if LCP\i] / LCP[i - 1] A LCP[i - 1] > ml then 

— the indices from up to i — 1 give a local maximum in LCP of appropriate length- 
if #{if [r[j] — 1] : up<j<i — lA r[j] > 1} = i — up + 1 then 
— check that all previous characters are different — 
report i — up +1 supermaximal repeats of size LCP[up\ whose 
list of positions in w are contiguous in r starting at up. 
end if 
end if 
end if 
end for 



Proposition 17. A substring of w is supermaximal to the right if and only if it is addressed 
by consecutive indices in the suffix array denoting a local maximum in LCP. 

Proof. A substring u of it; is supermaximal to the right, cf. Definition 16, exactly when 
there is a set of at least two consecutive suffixes addressed by r such that all the addressed 
suffixes have the same maximal common prefix, which is longer than that of any two other 
two suffixes, one taken in the set and one outside the set (note that any pair outside the set 
will have a common prefix of the same size that is necessarily different to the one we are 
considering). So, there is a minimum i, 1 < i < n, and there is a number of occurrences k >2 
such that 

- u = w[r[i]..r[i] + \u\ — 1], 

- LCP[i] = LCP[i + 1] = ... = LCP[i + k - 2], 

- {LCP[{\ > LCP[i - 1] or i = 1 ) and {LCP{i] > LCP[i + k - I] ov i + k - 1 = n - 1). D 

Also, if a substring of w is not supermaximal to the left, there are at least two extensions 
by one symbol to the left that are equal. 

Proposition 18. A substring u in w is supermaximal to the left if and only if it occurs at 
least twice but at most \A\ times in w and and all its extensions by one symbol to the left are 
pairwise different. 

Proof. Supermaximality to the left requires that the previous symbol of each of the addressed 
suffixes be pairwise different. If any two of these suffixes had the same previous symbol, then 
there would be an extension to the left that is repeated twice. D 

Proposition 19. A substring of a given string is a supermaximal repeat if and only if it is 
supermaximal to the right and to the left. 
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The algorithm findsmaxr first identifies the candidates that are super maximal to the right 
by finding the set of consecutive positions in r that have local maximums in LCP. Then it 
filters out the candidates that are not supermaximal to the left. To check supermaximality 
to the left, we construct the set of previous symbols (symbols that occur immediately before 
each suffix on the set). Since the universe of this set is the size of the alphabet A, it can easily 
be efficiently implemented in a boolean array of the size of A. We take as an extra parameter 
an integer ml which is the minimum length of a supermaximal repeat to be reported (it can 
be set to 1 if desired). The pseudocode is given in Algorithm 4. 

4.1 Correctness of findsmaxr 

Theorem 20 (Correctness). The algorithm findsmaxr computes all supermaximal repeats 
in the input string. 

Proof. Consider Algorithm 4. The main loop accesses sequentially the array LCP. By Propo- 
sition 19 supermaximal repeats are exactly the candidates that are supermaximal to the right 
and to the left. Propositions 17 and 18, characterize these properties with conditions on the 
data structures LCP and r. The algorithm checks exactly these conditions. D 

4.2 Complexity of findsmcixr 

Theorem 21 (Time complexity). Given the suffix array for an input string of length n, 
findsmaxr computes all supermaximal repeats in time 0{n). 

Proof. The main loop iterates n — 2 times. Inside the main loop, all steps are clearly done in 
0(1) except the construction of the set on the innermost if clause. This is done by inserting 
each element into the set, represented as a boolean array, and maintaining the size appropri- 
ately. Each insertion takes 0{1). Thus, the total number of insertions over all constructions 
of this set is the sum of the number of suffixes in each local maximum. Since no suffix can 
be in two different local maximums, this total is less than n. D 

Theorem 22 (Space complexity). For an input of size n, algorithm findsmaxr uses 0{n) 
memory space. More precisely, it uses n (2 wordsize + log \A\ -|- 2) -|- \A\ + 0(1) bits. 

Proof. Consider Algorithm 4. The whole input w is allocated in memory. Since it contains 
n symbols, its memory usage is nlog|^| bits. The data structures r and LCP are arrays 
of length n whose elements are between and n, and fit within the processor word. The 
described array for the set A has |^| bits. As before, since |^| < n, the term log |^| is 
considered 0(1), as is the term word_size. D 

5 Implementation and experimental results 

We implemented findmaxr and findsmaxr in C (ANSI C99), for a 32 or 64 bits machine. 

For findmaxr the memory space requirement is n (3 word_size -|- log \A\ +2) bits for an 
input of n. So, for A the ASCII code and storing indices in 32 bits variables this becomes a 
total memory requirement of 13.25n bytes. In a 64 bits processor and 8 Gb RAM installed, 
the tool runs inputs of size up to ~ 618 Mb, without any swapping. Somewhat larger inputs 
can also be run efficiently because some swapping does not affect the running time. Notice 
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Table 1: Input data set used for comparison. 



File Description Size (bytes) 

linux The Linux Kernel 2.6.31 tar file 365 711 360 

HS-chl Homo-sapiens chromosome 1, from NCBI 36.49 251 370 600 

ecoli The file E.coli of the large Canterbury corpus 4 638 690 

bible The file bible.txt of the large Canterbury copus 4 047 392 

world The file worldl92.txt of the large Canterbury corpus 2 473 400 

a2M The letter a repeated two million times. 2 000 000 

Canterbury corpus can be found at http: //corpus, canterbury. ac.nz/descriptions/#cantrbry, 
while the FASTA files for the NCBI 36.49 DNA human genome are downloadable at 
ftp : //ftp . ensembl . org/pub/release-49/f asta/homo_sapiens/dna/ 

Table 2: Input size in bytes and running time in seconds of the three processes 



File 


Size 


SA 


findmaxr 


findsmaxr 


linux 


365 711 360 


671.785 


275.834 


66.587 


HS-chl 


251 370 600 


798.583 


197.545 


100.714 


ecoli 


4 638 690 


4.586 


2.543 


1.636 


bible 


4 047 392 


3.201 


2.085 


0.750 


world 


2 473 400 


1.670 


1.224 


0.377 


a2M 


2 000 000 


2.423 


29.135 


0.111 



that the in main loop of the Algorithm 3, the array / is only used one element at a time, so 
it can be easily handled by swap memory. 

For findsmaxr the memory space requirement is n (2 word^ize + log |^| + 2) + \A\. Again, 
for A the ASCII code and indices in 32 bits, the memory requirement is 9.25n bytes, which 
allows inputs of ~ 885M6 in the same configuration. 

We tested this implementation on large inputs, using an Intel(R) Core"'"^2 Duo E6300 (only 
one core), running at 1.86GHz with 8GB RAM (DDR2-800) under Ubuntu linux for 64 bits. 

The programs were compiled with the GCC compiler version 4.2.4, with option -02 for 
normal optimization. The reported times are user times, counting only the time consumed 
by the algorithm, not including the time to load the input from disk. The input files used 
are described in Table 1. The files are chosen to demonstrate the behavior of the program 
for different kinds of natural data as well as degenerated cases. We used input sizes of the 
order of magnitude of the calculated limit, validating the performance in those cases. See [1] 
for experiments that push this to the actual limit. 

Table 2 reports, for each case, the input size expressed in bytes and the running time 
expressed in seconds of three processes: SA our implementation of the suffix array construction 
of [14], findmaxr the computation of all maximal repeats with ml = 1, and findsmaxr the 
computation of all supermaximal repeats with ml = 1. The time of both algorithms does not 
include the time for the construction of the suffix array. To know the time to compute the 
repeats from scratch, simply add both times. 
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5.1 Some remarks 

The first step in our algorithms is the computation of the suffix array of the input string. 
We chose the smart algorithm of [14] that, profiting from the fact that the sorting is done on 
suffixes and not on arbitrary strings, it achieves a fast 0{n\ogn) time and requires O(nlogn) 
memory if the bit representation of each integer is accounted for. 

An information-theoretic argument tells that 0(nlog|^|) bits are required to represent 
the permutation given by the suffix array, because there are |^|" different strings of length n 
over the alphabet A; hence, there are at most that many different suffix arrays, and we need 
r2(n log 1^1) bits to distinguish between them. With suffix arrays nlogn bits are used for this 
instead. Compressed suffix arrays were used to get closer to the lower bound ([3, 5, 13]), but 
at the cost of increasing the computational time [2] (for instance, when mapping substrings 
with many ocurrences). 

The algorithm of Lippert [15] is based on a compressed suffix array and finds all repeats 
of a given length in the input string. In contrast to our algorithm, Lippert 's does not indicate 
whether a repeat is maximal. His experiments show the thta his solution requires much more 
time than ours. 

The problem of finding repeats in large inputs occurs in genomic sequence analysis. A list 
of the most popular repeat finders for genomic sequences appears in the survey [19]. Leaving 
aside heuristic and library based methods such as RepeatMasker (2009), existing methods 
based on the suffix array still use the length of repeats as a parameter at each pass. The 
algorithm of [17] constructs a suffix array after building first a suffix tree of the input, hence 
yielding a very poor performance in terms of time and memory. A popular tool is REPuter, 
[12, 11]. It allows a very limited input size and its memory requirement depends on the repeat 
length and the number of occurrences in the input. In the worst case inputs are limited to 
RAM size/45. Since the output given by REPuter is not factorized, it becomes very large, 
needing 0{v?) space for inputs of size n. 
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